Design an agent to fly a quadcopter, and then train it using a reinforcement learning algorithm of your choice!
Try to apply the techniques you have learnt, but also feel free to come up with innovative ideas and test them.
Take a look at the files in the directory to better understand the structure of the project.
task.py: Define your task (environment) in this file.agents/: Folder containing reinforcement learning agents.policy_search.py: A sample agent has been provided here.agent.py: Develop your agent here.physics_sim.py: This file contains the simulator for the quadcopter. DO NOT MODIFY THIS FILE.For this project, you will define your own task in task.py. Although we have provided a example task to get you started, you are encouraged to change it. Later in this notebook, you will learn more about how to amend this file.
You will also design a reinforcement learning agent in agent.py to complete your chosen task.
You are welcome to create any additional files to help you to organize your code. For instance, you may find it useful to define a model.py file defining any needed neural network architectures.
We provide a sample agent in the code cell below to show you how to use the sim to control the quadcopter. This agent is even simpler than the sample agent that you'll examine (in agents/policy_search.py) later in this notebook!
The agent controls the quadcopter by setting the revolutions per second on each of its four rotors. The provided agent in the Basic_Agent class below always selects a random action for each of the four rotors. These four speeds are returned by the act method as a list of four floating-point numbers.
For this project, the agent that you will implement in agents/agent.py will have a far more intelligent method for selecting actions!
import random
class Basic_Agent():
def __init__(self, task):
self.task = task
def act(self):
new_thrust = random.gauss(450., 25.)
return [new_thrust + random.gauss(0., 1.) for x in range(4)]
Run the code cell below to have the agent select actions to control the quadcopter.
Feel free to change the provided values of runtime, init_pose, init_velocities, and init_angle_velocities below to change the starting conditions of the quadcopter.
The labels list below annotates statistics that are saved while running the simulation. All of this information is saved in a text file data.txt and stored in the dictionary results.
%load_ext autoreload
%autoreload 2
import csv
import numpy as np
from task import Task
import matplotlib.pyplot as plt
%matplotlib inline
import utils
# Modify the values below to give the quadcopter a different starting position.
runtime = 5. # time limit of the episode
init_pose = np.array([0., 0., 10., 0., 0., 0.]) # initial pose
init_velocities = np.array([0., 0., 0.]) # initial velocities
init_angle_velocities = np.array([0., 0., 0.]) # initial angle velocities
file_output = 'data.txt' # file name for saved results
# Setup
task = Task(init_pose, init_velocities, init_angle_velocities, runtime)
agent = Basic_Agent(task)
done = False
labels = ['time', 'x', 'y', 'z', 'phi', 'theta', 'psi', 'x_velocity',
'y_velocity', 'z_velocity', 'phi_velocity', 'theta_velocity',
'psi_velocity', 'rotor_speed1', 'rotor_speed2', 'rotor_speed3', 'rotor_speed4','reward']
results = {x : [] for x in labels}
setPlot=utils.SetupPlot('data1.txt',task)
# Run the simulation, and save the results.
with open(file_output, 'w') as csvfile:
writer = csv.writer(csvfile)
writer.writerow(labels)
while True:
rotor_speeds = agent.act()#rotor_speeds are actions
next_state, reward, done = task.step(rotor_speeds)#returns next_state,reward,done
to_write = [task.sim.time] + list(task.sim.pose) + list(task.sim.v) + list(task.sim.angular_v) \
+ list(rotor_speeds)+[reward]
results1=setPlot.writeResults(reward,rotor_speeds)
for ii in range(len(labels)):
results[labels[ii]].append(to_write[ii])
writer.writerow(to_write)
if done:
break
utils.plot_run(results)
utils.plot_run(results1)
When specifying a task, you will derive the environment state from the simulator. Run the code cell below to print the values of the following variables at the end of the simulation:
task.sim.pose (the position of the quadcopter in ($x,y,z$) dimensions and the Euler angles),task.sim.v (the velocity of the quadcopter in ($x,y,z$) dimensions), andtask.sim.angular_v (radians/second for each of the three Euler angles).# the pose, velocity, and angular velocity of the quadcopter at the end of the episode
print("Position: ", task.sim.pose[:3])
print("Orientation: ", task.sim.pose[3:])
print("Velocity: ", task.sim.v)
print("Angular vel: ", task.sim.angular_v)
In the sample task in task.py, we use the 6-dimensional pose of the quadcopter to construct the state of the environment at each timestep. However, when amending the task for your purposes, you are welcome to expand the size of the state vector by including the velocity information. You can use any combination of the pose, velocity, and angular velocity - feel free to tinker here, and construct the state to suit your task.
A sample task has been provided for you in task.py. Open this file in a new window now.
The __init__() method is used to initialize several variables that are needed to specify the task.
PhysicsSim class (from physics_sim.py). action_repeats timesteps. If you are not familiar with action repeats, please read the Results section in the DDPG paper.state_size), we must take action repeats into account. action_size=4). You can set the minimum (action_low) and maximum (action_high) values of each entry here.The reset() method resets the simulator. The agent should call this method every time the episode ends. You can see an example of this in the code cell below.
The step() method is perhaps the most important. It accepts the agent's choice of action rotor_speeds, which is used to prepare the next state to pass on to the agent. Then, the reward is computed from get_reward(). The episode is considered done if the time limit has been exceeded, or the quadcopter has travelled outside of the bounds of the simulation.
In the next section, you will learn how to test the performance of an agent on this task.
The sample agent given in agents/policy_search.py uses a very simplistic linear policy to directly compute the action vector as a dot product of the state vector and a matrix of weights. Then, it randomly perturbs the parameters by adding some Gaussian noise, to produce a different policy. Based on the average reward obtained in each episode (score), it keeps track of the best set of parameters found so far, how the score is changing, and accordingly tweaks a scaling factor to widen or tighten the noise.
Run the code cell below to see how the agent performs on the sample task.
import sys
import pandas as pd
from agents.policy_search import PolicySearch_Agent
from task import Task
from utils import SetupPlot
num_episodes = 1000
init_pose = np.array([0.0, 0.0, 10.0,
0.0, 0.0, 0.0])
target_pos = np.array([0., 0., 10.])
task = Task(init_pose=init_pose,target_pos=target_pos)
agent = PolicySearch_Agent(task)
for i_episode in range(1, num_episodes+1):
setPlot=SetupPlot("data1.txt",task)
state = agent.reset_episode() # start a new episode
while True:
action = agent.act(state)
next_state, reward, done = task.step(action)
results=setPlot.writeResults(reward,action)
agent.step(reward, done)
state = next_state
if done:
print("\rEpisode = {:4d}, score = {:7.3f} (best = {:7.3f}), noise_scale = {}".format(
i_episode, agent.score, agent.best_score, agent.noise_scale), end="") # [debug]
break
sys.stdout.flush()
utils.plot_run(results)
This agent should perform very poorly on this task. And that's where you come in!
Amend task.py to specify a task of your choosing. If you're unsure what kind of task to specify, you may like to teach your quadcopter to takeoff, hover in place, land softly, or reach a target pose.
After specifying your task, use the sample agent in agents/policy_search.py as a template to define your own agent in agents/agent.py. You can borrow whatever you need from the sample agent, including ideas on how you might modularize your code (using helper methods like act(), learn(), reset_episode(), etc.).
Note that it is highly unlikely that the first agent and task that you specify will learn well. You will likely have to tweak various hyperparameters and the reward function for your task until you arrive at reasonably good behavior.
As you develop your agent, it's important to keep an eye on how it's performing. Use the code above as inspiration to build in a mechanism to log/save the total rewards obtained in each episode to file. If the episode rewards are gradually increasing, this is an indication that your agent is learning.
## TODO: Train your agent here.
import sys
import numpy as np
import pandas as pd
from agents.agent import DDPG
import utils
from task import Task
from utils import SetupPlot
%load_ext autoreload
%autoreload 2
rewards_list=[]
num_episodes = 400
init_pose = np.array([0.0, 0.0, 10.0,
0.0, 0.0, 0.0])
target_pos = np.array([0., 0., 10.])
task = Task(init_pose=init_pose,target_pos=target_pos)
agent = DDPG(task)
for i_episode in range(1, num_episodes+1):
setPlot=SetupPlot("data1.txt",task)
state = agent.reset_episode() # start a new episode
while True:
action = agent.act(state)
next_state, reward, done = task.step(action)
if reward<-1:
done=True
reward=-1000.0
resultsTemp=setPlot.writeResults(reward,action)
agent.step(action,reward, next_state, done)
state = next_state
#print("rewards:{}, action:{}".format(reward,action))
sys.stdout.flush()
if done:
results=resultsTemp
print("\rEpisode = {:4d}, score = {:7.3f} (best = {:7.3f})".format(
i_episode, agent.score, agent.best_score), end="") # [debug]
utils.plot_run(results)
rewards_list.append(agent.score)
break
if agent.score>=245:#considered the hover task complete
break
sys.stdout.flush()
Once you are satisfied with your performance, plot the episode rewards, either from a single run, or averaged over multiple runs.
import utils
## TODO: Plot the rewards.
utils.plot_run(results)
import matplotlib.pyplot as plt
%matplotlib inline
plt.figure(figsize=(18,10))
plt.plot(list(range(1,len(rewards_list)+1)), rewards_list, label='reward/episode')
plt.legend()
plt.xlabel('episode', fontsize=20)
plt.ylabel('reward',fontsize=20)
_ = plt.ylim()
Question 1: Describe the task that you specified in task.py. How did you design the reward function?
Answer:
The task designed in task.py is a hovering task, where the objective is to maintain a certain altitude during the running time, which is by default 5 seconds. The drone start position is at the same altitude as the target position with 0 linear velocities, 0 angular velocities and the range action for rotor speed from 382 to 445 rad/sec.
The reward function seeks to favor the reduction of the difference between the target altitude and the current position(distance in the z-axis), whereas it heavily punishes whenever the altitude of the drone is larger than 2 meters of distance this way getting a nice reward range between -1 and 1 with the exception when the drone is 2 meters away of the hovering altitude where the reward is -1000 and the episode is immediately terminated.
Question 2: Discuss your agent briefly, using the following questions as a guide:
Answer: The algorithm chosen for the task explained above was the deep deterministic policy gradient DDPG also called the actor-critic algorithm, it was chosen because both the state space and action space for the task were infinite or continuous.
The hyperparameters chosen for the model were the following:
The network architecture for the actor was as follows:
The network architecture for the critic was as follows:
An adding layer to merge both the state branch and action branch with a leakyRelu activation function
Both networks implemented an Adam Optimizer with a learning rate of LR = 1e-6 On the other hand, the model also implements a Replay buffer of size 100000 as the memory with a batch size of 128
Question 3: Using the episode rewards plot, discuss how the agent learned over time.
Answer: The hovering task was an extremely hard task to achieve in the beginning, after weeks of training I decided to reduce the complexity of the task by reducing the action size from 4 to 1. To make this change the 4 rotors were set up to have the same velocity so that one single action can control the 4 rotors, thus reducing the problem to the z-axis space.
After the reduction in complexity, the learning curve oscillated a little bit at the beginning and after around 130 episodes the policy's performance ramps up suddenly to a policy closed to the optimal and after other 200 episodes oscillating the policy finally attains optimality. At the end the performance reached a mean reward of around 240 in 10 episodes
Question 4: Briefly summarize your experience working on this project. You can use the following prompts for ideas.
Answer: The hardest part of the project without a doubt was to start since there was not a clear explanation of how the environment works, perhaps a visualization tool of the drone flying in the environment would've been really useful. Afterwards, specifying the task was the next challenging work, which combined with the low prior experience in the DDPG algorithm was hard to train, leading into a lot of fails. Identifying which part was failing between the task and the agent was hard, so the OpenAI gym was taken to test the algorithm and thus concluding that a reduction of the complexity in the task was required.
In conclusion, I think that the quadcopter is an extremely difficult task to control, at least from my point of view it is required to divide the problem into subproblems in order to reduce complexity for the RL algorithms and thus achieve a high policy for the task. I found this project fascinating, especially in how an RL algorithm can achieve a similar performance as a Control Theory algorithm could do.